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Abstract 


We analyze two communication-efficient algorithms for distributed optimization in statistical set- 
tings involving large-scale data sets. The first algorithm is a standard averaging method that 
distributes the N data samples evenly to m machines, performs separate minimization on each 
subset, and then averages the estimates. We provide a sharp analysis of this average mixture 
algorithm, showing that under a reasonable set of conditions, the combined parameter achieves 
mean-squared error (MSE) that decays as O(N~! + (N/m)~*). Whenever m < VN, this guaran- 
tee matches the best possible rate achievable by a centralized algorithm having access to all N 
samples. The second algorithm is a novel method, based on an appropriate form of bootstrap 
subsampling. Requiring only a single round of communication, it has mean-squared error that 
decays as O(N! + (N/m)~3), and so is more robust to the amount of parallelization. In ad- 
dition, we show that a stochastic gradient-based method attains mean-squared error decaying as 
O(N! + (N/m)~?/2), easing computation at the expense of a potentially slower MSE rate. We 
also provide an experimental evaluation of our methods, investigating their performance both on 
simulated data and on a large-scale regression problem from the internet search domain. In particu- 
lar, we show that our methods can be used to efficiently solve an advertisement prediction problem 
from the Chinese SoSo Search Engine, which involves logistic regression with N œ 2.4 x 108 sam- 
ples and d ~ 740,000 covariates. 

Keywords: distributed learning, stochastic optimization, averaging, subsampling 


1. Introduction 


Many procedures for statistical estimation are based on a form of (regularized) empirical risk min- 
imization, meaning that a parameter of interest is estimated by minimizing an objective function 
defined by the average of a loss function over the data. Given the current explosion in the size and 
amount of data available in statistical studies, a central challenge is to design efficient algorithms for 
solving large-scale problem instances. In a centralized setting, there are many procedures for solv- 
ing empirical risk minimization problems, among them standard convex programming approaches 
(e.g., Boyd and Vandenberghe, 2004) as well as stochastic approximation and optimization algo- 
rithms (Robbins and Monro, 1951; Hazan et al., 2006; Nemirovski et al., 2009). When the size of 
the data set becomes extremely large, however, it may be infeasible to store all of the data on a 
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single computer, or at least to keep the data in memory. Accordingly, the focus of this paper is the 
study of some distributed and communication-efficient procedures for empirical risk minimization. 

Recent years have witnessed a flurry of research on distributed approaches to solving very large- 
scale statistical optimization problems. Although we cannot survey the literature adequately—the 
papers Nedić and Ozdaglar (2009), Ram et al. (2010), Johansson et al. (2009), Duchi et al. (2012a), 
Dekel et al. (2012), Agarwal and Duchi (2011), Recht et al. (2011), Duchi et al. (2012b) and ref- 
erences therein contain a sample of relevant work—we touch on a few important themes here. It 
can be difficult within a purely optimization-theoretic setting to show explicit benefits arising from 
distributed computation. In statistical settings, however, distributed computation can lead to gains in 
computational efficiency, as shown by a number of authors (Agarwal and Duchi, 2011; Dekel et al., 
2012; Recht et al., 2011; Duchi et al., 2012b). Within the family of distributed algorithms, there can 
be significant differences in communication complexity: different computers must be synchronized, 
and when the dimensionality of the data is high, communication can be prohibitively expensive. It 
is thus interesting to study distributed estimation algorithms that require fairly limited synchroniza- 
tion and communication while still enjoying the greater statistical accuracy that is usually associated 
with a larger data set. 

With this context, perhaps the simplest algorithm for distributed statistical estimation is what we 
term the average mixture (AVGM) algorithm. It is an appealingly simple method: given m different 
machines and a data set of size N, first assign to each machine a (distinct) data set of size n = N/m, 
then have each machine i compute the empirical minimizer 9; on its fraction of the data, and finally 
average all the parameter estimates 8; across the machines. This approach has been studied for some 
classification and estimation problems by Mann et al. (2009) and McDonald et al. (2010), as well 
as for certain stochastic approximation methods by Zinkevich et al. (2010). Given an empirical risk 
minimization algorithm that works on one machine, the procedure is straightforward to implement 
and is extremely communication efficient, requiring only a single round of communication. It is 
also relatively robust to possible failures in a subset of machines and/or differences in speeds, since 
there is no repeated synchronization. When the local estimators are all unbiased, it is clear that the 
the AVGM procedure will yield an estimate that is essentially as good as that of an estimator based 
on all N samples. However, many estimators used in practice are biased, and so it is natural to ask 
whether the method has any guarantees in a more general setting. To the best of our knowledge, 
however, no work has shown rigorously that the AVGM procedure generally has greater efficiency 
than the naive approach of using n = N/m samples on a single machine. 

This paper makes three main contributions. First, in Section 3, we provide a sharp analysis of 
the AVGM algorithm, showing that under a reasonable set of conditions on the population risk, it 
can indeed achieve substantially better rates than the naive approach. More concretely, we provide 
bounds on the mean-squared error (MSE) that decay as O((nm)~!+n~?). Whenever the number 
of machines m is less than the number of samples n per machine, this guarantee matches the best 
possible rate achievable by a centralized algorithm having access to all N = nm samples. In the 
special case of optimizing log likelihoods, the pre-factor in our bound involves the trace of the 
Fisher information, a quantity well-known to control the fundamental limits of statistical estimation. 
We also show how the result extends to stochastic programming approaches, exhibiting a stochastic 
gradient-descent based procedure that also attains convergence rates scaling as O((nm)~'), but with 
slightly worse dependence on different problem-specific parameters. 

Our second contribution is to develop a novel extension of simple averaging. It is based on an 
appropriate form of resampling (Efron and Tibshirani, 1993; Hall, 1992; Politis et al., 1999), which 
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we refer to as the subsampled average mixture (SAVGM) approach. At a high level, the SAVGM 
algorithm distributes samples evenly among m processors or computers as before, but instead of 
simply returning the empirical minimizer, each processor further subsamples its own data set in 
order to estimate the bias of its own estimate, and returns a subsample-corrected estimate. We 
establish that the SAVGM algorithm has mean-squared error decaying as O(m~!n~!+n~°). As long 
as m <n’, the subsampled method again matches the centralized gold standard in the first-order 
term, and has a second-order term smaller than the standard averaging approach. 

Our third contribution is to perform a detailed empirical evaluation of both the AVGM and 
SAVGM procedures, which we present in Sections 4 and 5. Using simulated data from normal and 
non-normal regression models, we explore the conditions under which the SAVGM algorithm yields 
better performance than the AVGM algorithm; in addition, we study the performance of both meth- 
ods relative to an oracle baseline that uses all N samples. We also study the sensitivity of the algo- 
rithms to the number of splits m of the data, and in the SAVGM case, we investigate the sensitivity of 
the method to the amount of resampling. These simulations show that both AVGM and SAVGM have 
favourable performance, even when compared to the unattainable “gold standard” procedure that 
has access to all N samples. In Section 5, we complement our simulation experiments with a large 
logistic regression experiment that arises from the problem of predicting whether a user of a search 
engine will click on an advertisement. This experiment is large enough—involving N ~ 2.4 x 108 
samples in d ~ 740,000 dimensions with a storage size of approximately 55 gigabytes—that it is 
difficult to solve efficiently on one machine. Consequently, a distributed approach is essential to 
take full advantage of this data set. Our experiments on this problem show that SAVGM—with the 
resampling and correction it provides—gives substantial performance benefits over naive solutions 
as well as the averaging algorithm AVGM. 


2. Background and Problem Set-up 


We begin by setting up our decision-theoretic framework for empirical risk minimization, after 
which we describe our algorithms and the assumptions we require for our main theoretical results. 


2.1 Empirical Risk Minimization 


Let {f(-;x), x E€ X} be a collection of real-valued and convex loss functions, each defined on a set 
containing the convex set © C R°. Let P be a probability distribution over the sample space X. 
Assuming that each function x ++ f (8; x) is P-integrable, the population risk Fo : © — R is given by 


Fo(@) = Eelf(@:X)] = | £(@:x)dP(). 














Our goal is to estimate the parameter vector minimizing the population risk, namely the quantity 
6* := argmin Fo(8) = argmin | f(0;x)dP(x), 
oco OEO 


which we assume to be unique. In practice, the population distribution P is unknown to us, but we 
have access to a collection S of samples from the distribution P. Empirical risk minimization is 
based on estimating 0* by solving the optimization problem 


8 € argmin { 5 (0:s)}. 


oco XES 
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Throughout the paper, we impose some regularity conditions on the parameter space, the risk 
function Fo, and the instantaneous loss functions f(-;x) : © — R. These conditions are standard in 
classical statistical analysis of M-estimators (e.g., Lehmann and Casella, 1998; Keener, 2010). Our 
first assumption deals with the relationship of the parameter space to the optimal parameter 0*. 


Assumption 1 (Parameters) The parameter space © C R! is a compact convex set, with ®* € int® 
and ¢y-radius R = max || — ©* ||>. 
Oco 


In addition, the risk function is required to have some amount of curvature. We formalize this notion 
in terms of the Hessian of Fọ: 


Assumption 2 (Local strong convexity) The population risk is twice differentiable, and there ex- 
ists a parameter A > 0 such that V?Fo(0*) = Maxa. 


Here V? (0) denotes the d x d Hessian matrix of the population objective Fo evaluated at 0, and 
we use = to denote the positive semidefinite ordering (i.e., A = B means that A — B is positive 
semidefinite.) This local condition is milder than a global strong convexity condition and is required 
to hold only for the population risk Fo evaluated at 0*. It is worth observing that some type of 
curvature of the risk is required for any method to consistently estimate the parameters 0*. 


2.2 Averaging Methods 


Consider a data set consisting of N = mn samples, drawn i.i.d. according to the distribution P. In 
the distributed setting, we divide this N-sample data set evenly and uniformly at random among a 
total of m processors. (For simplicity, we have assumed the total number of samples is a multiple 
of m.) For i=1,...,m, we let Sı; denote the data set assigned to processor i; by construction, it 
is a collection of n samples drawn i.i.d. according to P, and the samples in subsets $1 ; and S1; are 
independent for i Æ j. In addition, for each processor i we define the (local) empirical distribution 
P; ; and empirical objective Fy ; via 


Pii = DT 
With this notation, the AVGM algorithm is very simple to describe. 


2.2.1 AVERAGE MIXTURE ALGORITHM 


(1) For each i € {1,...,m}, processor i uses its local data set S4 ; to compute the local empirical 
minimizer 
ORES argmin { a L f(0;x) \. (1) 
eco l [Siu ES; 
F; ;(8) 


(2) These m local estimates are then averaged together—that is, we compute 


I 


3 


ee ip (2) 


1 


3|- 


i 
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The subsampled average mixture (S AVGM) algorithm is based on an additional level of sampling 
on top of the first, involving a fixed subsampling rate r € [0,1]. It consists of the following additional 
steps: 


2.2.2 SUBSAMPLED AVERAGE MIXTURE ALGORITHM 


(1) Each processor i draws a subset S2; of size [rn] by sampling uniformly at random without 
replacement from its local data set $4 ;. 


(2) Each processor i computes both the local empirical minimizers 6, ; from Equation (1) and the 
empirical minimizer 


02; € argmin { E Ł f(8;x) k 


oco [S2;l xES2,j 


—— mamm 
F> ;(8) 


(3) In addition to the previous average (2), the SAVGM algorithm computes the bootstrap average 
05 := tyr 8. ;, and then returns the weighted combination 


0; = r0> 


l-r 





(3) 


Osavom i= 


The intuition for the weighted estimator (3) is similar to that for standard bias correction pro- 
cedures using the bootstrap or subsampling (Efron and Tibshirani, 1993; Hall, 1992; Politis et al., 
1999). Roughly speaking, if bp = 0* — 0, is the bias of the first estimator, then we may approximate 
bo by the subsampled estimate of bias bı = 0* — 02. Then, we use the fact that bı ~ bo/r to argue 
that 6* ~ (0; — r02)/(1 — r). The re-normalization enforces that the relative “weights” of 8, and 05 
sum to 1. 


The goal of this paper is to understand under what conditions—and in what sense—the estima- 
tors (2) and (3) approach the oracle performance, by which we mean the error of a centralized risk 
minimization procedure that is given access to all N = nm samples. 


2.2.3 NOTATION 


Before continuing, we define the remainder of our notation. We use £2 to denote the usual Euclidean 
1 : ae : . 
norm ||6||, = (ee 1 07)2. The -operator norm of a matrix A € R@*@ is its maximum singular 
value, defined by 
I|Alll,:= sup |/Avf2. 


vEeR®,||v|]><1 


A convex function F is A-strongly convex on a set U C Rf if for arbitrary u,v € U we have 
A 2 
F(u) > F(v) +(VF(v),u—v) +5 llu- vlz- 


(If F is not differentiable, we may replace VF with any subgradient of F.) We let & denote the 
Kronecker product, and for a pair of vectors u,v, we define the outer product u & v = uv! . Fora 
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three-times differentiable function F, we denote the third derivative tensor by V?F, so that for each 
u € domF the operator V>F (u) : RI% — R! is linear and satisfies the relation 


nee eee 
[V F(u)(v@v)], oa pa (aa) VjVk. 


We denote the indicator function of an event £ by 1g), which is 1 if £ is true and 0 otherwise. 


3. Theoretical Results 


Having described the AVGM and SAVGM algorithms, we now turn to statements of our main theo- 
rems on their statistical properties, along with some consequences and comparison to past work. 


3.1 Smoothness Conditions 


In addition to our previously stated assumptions on the population risk, we require regularity con- 
ditions on the empirical risk functions. It is simplest to state these in terms of the functions 
© — f(0;x), and we note that, as with Assumption 2, we require these to hold only locally around 
the optimal point 6*, in particular within some Euclidean ball U = {0 € R? | ||6*—6]|, <p} c @ 
of radius p > 0. 


Assumption 3 (Smoothness) There are finite constants G,H such that the first and the second 
partial derivatives of f exist and satisfy the bounds 














E[||VF(0;X) ||] < G? and E||)|V2#(0;X) — V7F(6)|||5] < H8 for all @ €U. 




















In addition, for any x € X, the Hessian matrix V? f (8;x) is L(x)-Lipschitz continuous, meaning that 


\|| V7,(0'sx) — V7 f(8;x)|||, < L(x) l-el], for all 8,0’ EU. (4) 






































We require that E[L(X)*] < LÈ and E[(L(X) —E|L(X)])8] < LÈ for some finite constant L. 


It is an important insight of our analysis that some type of smoothness condition on the Hessian 
matrix, as in the Lipschitz condition (4), is essential in order for simple averaging methods to work. 
This necessity is illustrated by the following example: 


Example 1 (Necessity of Hessian conditions) Let X be a Bernoulli variable with parameter | 


2? 
and consider the loss function 


02—90 ifx=0 


5 
87le<o) +8 ifx=1, (a) 


f(8;x) = l 
where 1(ọ<o) is the indicator of the event {0 < 0}. The associated population risk is Fo(®) = 
1(0° + 871 (9<0)). Since |Fj(w) — Fy(v)| < 2|w— v], the population risk is strongly convex and 
smooth, but it has discontinuous second derivative. The unique minimizer of the population risk is 
@* = 0, and by an asymptotic expansion given in Appendix A, it can be shown that E[®, j] = Q(n~2). 














Consequently, the bias of 01 is Q(n-2 ), and the AVGM algorithm using N = mn observations must 
suffer mean squared error E|(®; — 9*)”] = Q(n7!). 
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The previous example establishes the necessity of a smoothness condition. However, in a certain 
sense, it is a pathological case: both the smoothness condition given in Assumption 3 and the local 
strong convexity condition given in Assumption 2 are relatively innocuous for practical problems. 
For instance, both conditions will hold for standard forms of regression, such as linear and logistic, 
as long as the population data covariance matrix is not rank deficient and the data has suitable 
moments. Moreover, in the linear regression case, one has L = 0. 


3.2 Bounds for Simple Averaging 


We now turn to our first theorem that provides guarantees on the statistical error associated with the 
AVGM procedure. We recall that 6* denotes the minimizer of the population objective function Fo, 
and that for each i € {1,...,m}, we use S; to denote a data set of n independent samples. For each i, 
we use 0; € argmingcg{ } Yxcs, f (0;x)} to denote a minimizer of the empirical risk for the data set 
Si, and we define the averaged vector 6 = 1 ™ ,9;. The following result bounds the mean-squared 
error between this averaged estimate and the minimizer 0* of the population risk. 


Theorem 1 Under Assumptions 1 through 3, the mean-squared error is upper bounded as 


:|@-e"|)] < Że] 


LG 
2 

+535 7 (4 logd+ E ) 
+ O(m ~n?) + O(n), 























V2P(e) v(e: x)|] (6) 


$ 
e| 


























VRTV] 


where c is a numerical constant. 


A slightly weaker corollary of Theorem 1 makes it easier to parse. In particular, note that 


(ü) 1 
IRETI < PREIA < E OD 


where step (i) follows from the inequality |||Ax|||> < |||A||| ||x|],, valid for any matrix A and vector x; 
and step (ii) follows from Assumption 2. In addition, Assumption 3 implies E[|||V f (9%; X) 13] <G’, 
and putting together the pieces, we have established the following. 














Corollary 2 Under the same conditions as Theorem 1, 


2G? cG LG? 
z [10-04] < H os (toga + 3 ) 00m ln?) + O(n). (8) 




















This upper bound shows that the leading term decays proportionally to (nm)~!, with the pre-factor 
depending inversely on the strong convexity constant À and growing proportionally with the bound 
G on the loss gradient. Although easily interpretable, the upper bound (8) can be loose, since it is 
based on the relatively weak series of bounds (7). 


The leading term in our original upper bound (6) involves the product of the gradient V f (0*; X) 


with the inverse Hessian. In many statistical settings, including the problem of linear regression, 
the effect of this matrix-vector multiplication is to perform some type of standardization. When the 
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loss f(-;x) : © — Ris actually the negative log-likelihood ¢(x | 0) for a parametric family of models 
{Po}, we can make this intuition precise. In particular, under suitable regularity conditions (e.g., 
Lehmann and Casella, 1998, Chapter 6), we can define the Fisher information matrix 


























1(0*) :=E [vex | 8*) VEX | o) = E[V¢(X | 0*)]. 


Recalling that N = mn is the total number of samples available, let us define the neighbourhood 
B2(0,t) := {0 € R° : ||6’—6]], < t}. Then under our assumptions, the Hájek-Le Cam minimax 
theorem (van der Vaart, 1998, Theorem 8.11) guarantees for any estimator Oy based on N samples 
that 





lim liminf sup NEg [lx — 95] > tr(I(6*)~'). 
c>% NI Be (Or eM) i 


In connection with Theorem 1, we obtain: 











Corollary 3 In addition to the conditions of Theorem 1, suppose that the loss functions f(-;x) are 

the negative log-likelihood (x | 0) for a parametric family {Pg,® € ©}. Then the mean-squared 

error is upper bounded as 

LG? 
A2 


cm? tr(1(0*)~!) 
WN? 

















E |||: —8*||3] < Z aq) + (Hoga + ) + O(mN~), 


where c is a numerical constant. 


Proof Rewriting the log-likelihood in the notation of Theorem 1, we have V£(x | 0*) = V f(0*;x) 
and all we need to note is that 


(0)! =E [r07 "Vex | O*)VE(X | eTe] 

















= E| (V7F(6")!Vf(6sX)) (VF) V F(X) |. 





Now apply the linearity of the trace and use the fact that tr(uu') = |lu|5. E 


Except for the factor of two in the bound, Corollary 3 shows that Theorem 1 essentially achieves 
the best possible result. The important aspect of our bound, however, is that we obtain this conver- 
gence rate without calculating an estimate on all N = mn samples: instead, we calculate m inde- 
pendent estimators, and then average them to attain the convergence guarantee. We remark that an 
inspection of our proof shows that, at the expense of worse constants on higher order terms, we can 
reduce the factor of 2/mn on the leading term in Theorem | to (1 +c)/mn for any constant c > 0; 
as made clear by Corollary 3, this is unimprovable, even by constant factors. 

As noted in the introduction, our bounds are certainly to be expected for unbiased estimators, 
since in such cases averaging m independent solutions reduces the variance by 1 /m. In this sense, 
our results are similar to classical distributional convergence results in M-estimation: for smooth 
enough problems, M-estimators behave asymptotically like averages (van der Vaart, 1998; Lehmann 
and Casella, 1998), and averaging multiple independent realizations reduces their variance. How- 
ever, it is often desirable to use biased estimators, and such bias introduces difficulty in the analysis, 
which we explore more in the next section. We also note that in contrast to classical asymptotic re- 
sults, our results are applicable to finite samples and give explicit upper bounds on the mean-squared 
error. Lastly, our results are not tied to a specific model, which allows for fairly general sampling 
distributions. 
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3.3 Bounds for Subsampled Mixture Averaging 


When the number of machines m is relatively small, Theorem 1 and Corollary 2 show that the 
convergence rate 2 the AVGM algorithm is mainly determined by the first term in the bound (6), 
which is at most re . In contrast, when the number of processors m grows, the second term in the 





bound (6), in spite oF being O(n7?), may have non-negligible effect. This issue is exacerbated when 
the local strong convexity parameter A of the risk Fo is close to zero or the Lipschitz continuity con- 
stant H of Vf is large. This concern motivated our development of the subsampled average mixture 
(SAVGM) algorithm, to which we now return. 


Due to the additional randomness introduced by the subsampling in SAVGM, its analysis requires 
an additional smoothness condition. In particular, recalling the Euclidean p-neighbourhood U of the 
optimum 8*, we require that the loss function f is (locally) smooth through its third derivatives. 


Assumption 4 (Strong smoothness) For each x € X, the third derivatives of f are M(x)-Lipschitz 
continuous, meaning that 


I| (V3 F(0:x) — V3 £(6'sx)) (u@u)||, < M(x) ||o—6'|, llul for all 0,0’ € U, and u € RË, 





where E[M8(X)| < M® for some constant M < œ. 











It is easy to verify that Assumption 4 holds for least-squares regression with M = 0. It also holds 
for various types of non-linear regression problems (e.g., logistic, multinomial etc.) as long as the 
covariates have finite eighth moments. 

With this set-up, our second theorem establishes that bootstrap sampling yields improved per- 
formance: 


Theorem 4 Under Assumptions 1 through 4, the output Bsavom = (81 — r82)/(1 — r) of the boot- 
strap SAVGM algorithm has mean-squared error bounded as 





























E [[Bswvon— 8] < Gas TE | 


r)? nm 


[MG | GIL?dlogd 1 e ee ae 
re\ 4S Te ae Gene N 


for a numerical constant c. 


VRTY] (9) 





Comparing the conclusions of Theorem 4 to those of Theorem 1, we see that the the O(n”) 
term in the bound (6) has been eliminated. The reason for this elimination is that subsampling 
at a rate r reduces the bias of the SAVGM algorithm to O(n), whereas in contrast, the bias 
of the AVGM algorithm induces terms of order n~*. Theorem 4 suggests that the performance 
of the SAVGM algorithm is affected by the subsampling rate r; in order to minimize the upper 
bound (9) in the regime m < N?’?, the optimal choice is of the form r œ C ym/n = Cm?/? /N where 
C ~ (G?/A*) max{MG/A, L\/dlogd}. Roughly, as the number of machines m becomes larger, we 
may increase r, since we enjoy averaging affects from the SAVGM algorithm. 

Let us consider the relative effects of having larger numbers of machines m for both the AVGM 
and SAVGM algorithms, which provides some guidance to selecting m in practice. We define 0? = 
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E[||V7Fo(0*)~!V f(0*;X) iA to be the asymptotic variance. Then to obtain the optimal convergence 
rate of o”/N, we must have 














2 
max{H? logd,L?G? /i7} 





1 2 221M 9 i 
zz max {H logd,L*G } x0 SW or m< N? (10) 


in Theorem 1. Applying the bound of Theorem 4, we find that to obtain the same rate we require 


WIN 








M?G? Ldlogd) Gm? | (1+r)o? 
max ; < 
A6 A4 rN? N 


Mr(1+r)o? ) 5 


<N? 
S, ( E 


Now suppose that we replace r with Cm3/? /N as in the previous paragraph. Under the conditions 
o? ~ G? and r = o(1), we then find that 


vAN 
WIN 








4262m3/2 3 2 3 

eae (z= meats) ae (= OAE] eee 
Comparing inequalities (10) and (11), we see that in both cases m may grow polynomially with 
the global sample size N while still guaranteeing optimal convergence rates. On one hand, this 
asymptotic growth is faster in the subsampled case (11); on the other hand, the dependence on the 
dimension d of the problem is more stringent than the standard averaging case (10). As the local 
strong convexity constant A of the population risk shrinks, both methods allow less splitting of the 
data, meaning that the sample size per machine must be larger. This limitation is intuitive, since 
lower curvature for the population risk means that the local empirical risks associated with each 
machine will inherit lower curvature as well, and this effect will be exacerbated with a small local 
sample size per machine. Averaging methods are, of course, not a panacea: the allowed number 
of partitions m does not grow linearly in either case, so blindly increasing the number of machines 
proportionally to the total sample size N will not lead to a useful estimate. 

In practice, an optimal choice of r may not be apparent, which may necessitate cross validation 
or another type of model evaluation. We leave as intriguing open questions whether computing 
multiple subsamples at each machine can yield improved performance or reduce the variance of the 
SAVGM procedure, and whether using estimates based on resampling the data with replacement, as 
opposed to without replacement as considered here, can yield improved performance. 


3.4 Time Complexity 


In practice, the exact empirical minimizers assumed in Theorems 1 and 4 may be unavailable. In- 
stead, we need to use a finite number of iterations of some optimization algorithm in order to obtain 
reasonable approximations to the exact minimizers. In this section, we sketch an argument that 
shows that both the AVGM algorithm and the SAVGM algorithm can use such approximate empir- 
ical minimizers, and as long as the optimization error is sufficiently small, the resulting averaged 
estimate achieves the same order-optimal statistical error. Here we provide the arguments only for 
the AVGM algorithm; the arguments for the SAVGM algorithm are analogous. 

More precisely, suppose that each processor runs a finite number of iterations of some optimiza- 
tion algorithm, thereby obtaining the vector 6; as an approximate minimizer of the objective function 
F; ;. Thus, the vector 6; can be viewed as an approximate form of 8;, and we let 6 = +  , 8; denote 
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the average of these approximate minimizers, which corresponds to the output of the approximate 
AVGM algorithm. With this notation, we have 


3 
z| 


where step (i) follows by triangle inequality and the elementary bound (a+b)? < 2a? + 2b’; step 
(ii) follows by Jensen’s inequality. Consequently, suppose that processor i runs enough iterations to 
obtain an approximate minimizer 0} such that 
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E (||; — ||] = O((mn)-*). (13) 


When this condition holds, the bound (12) shows that the average @ of the approximate minimizers 
shares the same convergence rates provided by Theorem 1. 

But how long does it take to compute an approximate minimizer ©; satisfying condition (13)? 
Assuming processing one sample requires one unit of time, we claim that this computation can be 
performed in time O(nlog(mn)). In particular, the following two-stage strategy, involving a com- 
bination of stochastic gradient descent (see the following subsection for more details) and standard 
gradient descent, has this complexity: 


(1) As shown in the proof of Theorem 1, with high probability, the empirical risk F; is strongly 
convex in a ball Bp (81) of constant radius p > 0 around 6;. Consequently, performing stochas- 
tic gradient descent on F; for O(log?(mn)/p7) iterations yields an approximate minimizer that 
falls within Bp (01) with high probability (e.g., Nemirovski et al., 2009, Proposition 2.1). Note 
that the radius p for local strong convexity is a property of the population risk Fo we use as a 
prior knowledge. 


(2) This initial estimate can be further improved by a few iterations of standard gradient descent. 
Under local strong convexity of the objective function, gradient descent is known to converge 
at a geometric rate (see, e.g., Nocedal and Wright, 2006; Boyd and Vandenberghe, 2004), 
so O(log(1/e)) iterations will reduce the error to order £. In our case, we have € = (mn)~?, 
and since each iteration of standard gradient descent requires O(n) units of time, a total of 
O(nlog(mn)) time units are sufficient to obtain a final estimate 0} satisfying condition (13). 


Overall, we conclude that the speed-up of the AVGM relative to the naive approach of processing 
all N = mn samples on one processor, is at least of order m/log(N). 


3.5 Stochastic Gradient Descent with Averaging 


The previous strategy involved a combination of stochastic gradient descent and standard gradient 
descent. In many settings, it may be appealing to use only a stochastic gradient algorithm, due 
to their ease of their implementation and limited computational requirements. In this section, we 
describe an extension of Theorem 1 to the case in which each machine computes an approximate 
minimizer using only stochastic gradient descent. 

Stochastic gradient algorithms have a lengthy history in statistics, optimization, and machine 
learning (Robbins and Monro, 1951; Polyak and Juditsky, 1992; Nemirovski et al., 2009; Rakhlin 
et al., 2012). Let us begin by briefly reviewing the basic form of stochastic gradient descent (SGD). 
Stochastic gradient descent algorithms iteratively update a parameter vector 6’ over time based on 
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randomly sampled gradient information. Specifically, at iteration t, a sample X; is drawn at random 
from the distribution P (or, in the case of a finite set of data {X1,...,X,}, a sample X; is chosen from 
the data set). The method then performs the following two steps: 


e+} =0' -NV S(0';X,) and O+ = argmin { |0 — 6"? |}. (14) 
oco 


Here n; > 0 is a stepsize, and the first update in (14) is a gradient descent step with respect to the 
random gradient V f(6';X;). The method then projects the intermediate point 6’ +2 back onto the 
constraint set © (if there is a constraint set). The convergence of SGD methods of the form (14) has 
been well-studied, and we refer the reader to the papers by Polyak and Juditsky (1992), Nemirovski 
et al. (2009), and Rakhlin et al. (2012) for deeper investigations. 

To prove convergence of our stochastic gradient-based averaging algorithms, we require the 
following smoothness and strong convexity condition, which is an alternative to the Assumptions 2 
and 3 used previously. 


Assumption 5 (Smoothness and Strong Convexity II) There exists a function L: X —> R+ such 
that 


||| V°£(8;x) — V7 f(6*;x) |||, < L(x) |@- ll for allx Ex, 











and EJL? (X)| < L? < œ. There are finite constants G and H such that 

















E[||Vf(0:X)\|5]< Gt, and EJ [VFX |É] < Ht for each fixed 0 € @. 




















In addition, the population function Fo is à-strongly convex over the space ®, meaning that 
V7Fo(0) > Maxa forall € @. 


Assumption 5 does not require as many moments as does Assumption 3, but it does require each 
moment bound to hold globally, that is, over the entire space ©, rather than only in a neighbourhood 
of the optimal point 0*. Similarly, the necessary curvature—in the form of the lower bound on 
the Hessian matrix V?Fọ—is also required to hold globally, rather than only locally. Nonetheless, 
Assumption 5 holds for many common problems; for instance, it holds for any linear regression 
problem in which the covariates have finite fourth moments and the domain © is compact. 


The averaged stochastic gradient algorithm (SGDAVGM) is based on the following two steps: 


(1) Given some constant c > 1, each machine performs n iterations of stochastic gradient de- 
scent (14) on its local data set of n samples using the stepsize n; = 3, then outputs the 
resulting local parameter 6'. 


(2) The algorithm computes the average 6" = tym Qi. 


The following result characterizes the mean-squared error of this procedure in terms of the constants 





:=4c? and B:= cH] co3/4G3/2 (a@/4LG'/2 4G+HR 
Q:=4c" an := max a e- Da2 1/2 ar p32 
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Theorem 5 Under Assumptions I and 5, the output 0 of the SAVGM algorithm has mean-squared 
error upper bounded as 
| 


Theorem 5 shows that the averaged stochastic gradient descent procedure attains the optimal 
convergence rate O(N!) as a function of the total number of observations N = mn. The constant 
and problem-dependent factors are somewhat worse than those in the earlier results we presented 
in Theorems 1 and 4, but the practical implementability of such a procedure may in some circum- 
stances outweigh those differences. We also note that the second term of order O(n~3/?) may be 
reduced to O(n2-24)/ K) for any k > 4 by assuming the existence of kth moments in Assumption 5; 
we show this in passing after our proof of the theorem in Appendix D. It is not clear whether a 
bootstrap correction is possible for the stochastic-gradient based estimator; such a correction could 
be significant, because the term B? /n3/ 2 arising from the bias in the stochastic gradient estimator 
may be non-trivial. We leave this question to future work. 


2 2 
<20 B 
T Xm n? 

















j" -e*] (15) 


4. Performance on Synthetic Data 


In this section, we report the results of simulation studies comparing the AVGM, SAVGM, and 
SGDAVGM methods, as well as a trivial method using only a fraction of the data available on a 
single machine. For each of our simulated experiments, we use a fixed total number of samples 
N = 100,000, but we vary the number of parallel splits m of the data (and consequently, the local 
data set sizes n = N/m) and the dimensionality d of the problem solved. 

For our experiments, we simulate data from one of three regression models: 


Se (u,x) TE, (16) 
d 

y= (u,x) + P vjxj te, or (17) 
j=l 

y= (u,x) Ir h(x)|el, (18) 





where € ~ N(0,1), and h is a function to be specified. Specifically, the data generation procedure 
is as follows. For each individual simulation, we choose fixed vector u € R? with entries u; dis- 
tributed uniformly in [0,1] (and similarly for v), and we set h(x) = a (x;/2)°. The models (16) 
through (18) provide points on a curve from correctly-specified to grossly mis-specified models, so 
models (17) and (18) help us understand the effects of subsampling in the SAVGM algorithm. (In 
contrast, the standard least-squares estimator is unbiased for model (16).) The noise variable € is 
always chosen as a standard Gaussian variate N (0, 1), independent from sample to sample. 
In our simulation experiments we use the least-squares loss 


F(05(x.9)) = 5 (8.x) =). 











The goal in each experiment is to estimate the vector 0* minimizing Fo(8) := E[f (0; (X,Y ))]. For 
each simulation, we generate N samples according to either the model (16) or (18). For each m € 
{2,4,8, 16, 32,64, 128}, we estimate 0* = arg ming Fo(8) using a parallel method with data split into 
m independent sets of size n = N/m, specifically 
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Figure 1: The error O — 0* ||} versus number of machines, with standard errors across twenty simu- 
lations, for solving least squares with data generated according to the normal model (16). 
The oracle least-squares estimate using all N samples is given by the line “All,” while the 
line “Single” gives the performance of the naive estimator using only n = N/m samples. 


(i) The AVGM method 
(ii) The SAVGM method with several settings of the subsampling ratio r 
(iii) The SGDAVGM method with stepsize n; = d/(10(d+t)), which gave good performance. 


In addition to (i)—(iii), we also estimate 0* with 


(iv) The empirical minimizer of a single split of the data of size n = N/m 


(v) The empirical minimizer on the full data set (the oracle solution). 


4.1 Averaging Methods 


For our first set of experiments, we study the performance of the averaging methods (AVGM and 
SAVGM), showing their scaling as the number of splits of data—the number of machines m—grows 
for fixed N and dimensions d = 20 and d = 200. We use the standard regression model (16) to 
generate the data, and throughout we let @ denote the estimate returned by the method under consid- 
eration (so in the AVGM case, for example, this is the vector ð = 01). The data samples consist of 
pairs (x,y), where x € R? and y € R is the target value. To sample each x vector, we choose five dis- 
tinct indices in {1,...,d} uniformly at random, and the entries of x at those indices are distributed 
as N(0,1). For the model (16), the population optimal vector 6* is u. 

In Figure 1, we plot the error ||6 — 6* ||5 of the inferred parameter vector @ for the true parameters 
6* versus the number of splits m, or equivalently, the number of separate machines available for use. 
We also plot standard errors (across twenty experiments) for each curve. As a baseline in each plot, 
we plot as a red line the squared error ||, — 0* || of the centralized “gold standard,” obtained by 
applying a batch method to all N samples. 

From the plots in Figure 1, we can make a few observations. The AVGM algorithm enjoys 
excellent performance, as predicted by our theoretical results, especially compared to the naive 


3334 


COMMUNICATION-EFFICIENT ALGORITHMS FOR STATISTICAL OPTIMIZATION 













































(= AVGM —All [= AVGM Al 
~-SGD-AVGM All ~-SGD-AVGM —All 
10° 10° 
5 5 
2 10 f 2 10 
[S] oi 
5 5 
N a = 
g 10°; 210° 
3 3 
= = 
10°; 10° 
107 ; i i : l i 10° i ; > : l i 
2 4 8 16 32 64 128 2 4 8 16 32 64 128 
Number m of machines Number m of machines 
(a)d =20 (b) d = 200 


Figure 2: Comparison of AVGM and SGDAVGM methods as in Figure 1 plotted on logarithmic 
scale. The plot shows ||6 — 6*||5 — 6 — 9* l2, where @y is the oracle least-squares esti- 
mator using all N data samples. 


solution using only a fraction 1/m of the data. In particular, if @ is obtained by the batch method, 
then AVGM is almost as good as the full-batch baseline even for m as large as 128, though there is 
some evident degradation in solution quality. The SGDAVGM (stochastic-gradient with averaging) 
solution also yields much higher accuracy than the naive solution, but its performance degrades 
more quickly than the AVGM method’s as m grows. In higher dimensions, both the AVGM and 
SGDAVGM procedures have somewhat worse performance; again, this is not unexpected since in 
high dimensions the strong convexity condition is satisfied with lower probability in local data sets. 

We present a comparison between the AVGM method and the SGDAVGM method with some- 
what more distinguishing power in Figure 2. For these plots, we compute the gap between the 
AVGM mean-squared-error and the unparallel baseline MSE, which is the accuracy lost due to par- 
allelization or distributing the inference procedure across multiple machines. Figure 2 shows that 
the mean-squared error grows polynomially with the number of machines m, which is consistent 
with our theoretical results. From Corollary 3, we expect the AVGM method to suffer (lower-order) 
penalties proportional to m? as m grows, while Theorem 5 suggests the somewhat faster growth 
we see for the SGDAVGM method in Figure 2. Thus, we see that the improved run-time perfor- 
mance of the SGDAVGM method—requiring only a single pass through the data on each machine, 
touching each datum only once—comes at the expense of some loss of accuracy, as measured by 
mean-squared error. 


4.2 Subsampling Correction 


We now turn to developing an understanding of the SAVGM algorithm in comparison to the standard 
average mixture algorithm, developing intuition for the benefits and drawbacks of the method. Be- 
fore describing the results, we remark that for the standard regression model (16), the least-squares 
solution is unbiased for 8*, so we expect subsampled averaging to yield little (if any) improvement. 
The SAVGM method is essentially aimed at correcting the bias of the estimator 1, and de-biasing an 
unbiased estimator only increases its variance. However, for the mis-specified models (17) and (18) 
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Figure 3: The error \|0 — 6*||5 plotted against the number of machines m for the AVGM and SAVGM 
methods, with standard errors across twenty simulations, using the normal regression 
model (16). The oracle estimator is denoted by the line “AIL.” 
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Figure 4: The error ||6 — 6* || plotted against the number of machines m for the AVGM and SAVGM 
methods, with standard errors across twenty simulations, using the non-normal regression 
model (18). The oracle estimator is denoted by the line “AIL.” 


we expect to see some performance gains. In our experiments, we use multiple sub-sampling rates 
to study their effects, choosing r € {0.005,0.01,0.02,0.04}, where we recall that the output of the 
SAVGM algorithm is the vector @ := (8; — r82) /(1—r). 

We begin with experiments in which the data is generated as in the previous section. That is, to 
generate a feature vector x €“, choose five distinct indices in {1,...,d} uniformly at random, and the 
entries of x at those indices are distributed as N (0,1). In Figure 3, we plot the results of simulations 
comparing AVGM and SAVGM with data generated from the normal regression model (16). Both 
algorithms have have low error rates, but the AVGM method is slightly better than the SAVGM 
method for both values of the dimension d and all and sub-sampling rates r. As expected, in this 
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Figure 5: The error O — 6*||5 plotted against the number of machines m for the AVGM and SAVGM 
methods using regression model (17). 


case the SAVGM method does not offer improvement over AVGM, since the estimators are unbiased. 
(In Figure 3(a), we note that the standard error is in fact very small, since the mean-squared error is 
only of order 107°.) 


To understand settings in which subsampling for bias correction helps, in Figure 4, we plot 
mean-square error curves for the least-squares regression problem when the vector y is sampled 
according to the non-normal regression model (18). In this case, the least-squares estimator is 
biased for 6* (which, as before, we estimate by solving a larger regression problem using 10N data 
samples). Figure 4 shows that both the AVGM and SAVGM method still enjoy good performance; 
in some cases, the SAVGM method even beats the oracle least-squares estimator for 6* that uses 
all N samples. Since the AVGM estimate is biased in this case, its error curve increases roughly 
quadratically with m, which agrees with our theoretical predictions in Theorem 1. In contrast, we see 
that the SAVGM algorithm enjoys somewhat more stable performance, with increasing benefit as the 
number of machines m increases. For example, in case of d = 200, if we choose r = 0.01 for m < 32, 
choose r = 0.02 for m = 64 and r = 0.04 for m = 128, then SAVGM has performance comparable 
with the oracle method that uses all N samples. Moreover, we see that all the values of r—at least 
for the reasonably small values we use in the experiment—provide performance improvements over 
a non-subsampled distributed estimator. 


For our final simulation, we plot results comparing SAVGM with AVGM in model (17), which is 
mis-specified but still a normal model. We use a simpler data generating mechanism, specifically, 
we draw x ~ N (0, I4xa) from a standard d-dimensional normal, and v is chosen uniformly in {0, 1]; 
in this case, the population minimizer has the closed form 6* = u + 3v. Figure 5 shows the results 
for dimensions d = 20 and d = 40 performed over 100 experiments (the standard errors are too 
small to see). Since the model (17) is not that badly mis-specified, the performance of the SAVGM 
method improves upon that of the AVGM method only for relatively large values of m, however, the 
performance of the SAVGM is always at least as good as that of AVGM. 
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Feature Name | Dimension | Description 

Query 20000 | Word tokens appearing in the query. 

Gender 3 | Gender of the user 

Keyword 20000 | Word tokens appearing in the purchase keywords. 
Title 20000 | Word tokens appearing in the ad title. 
Advertiser 39191 | Advertiser’s ID 

AdID 641707 | Advertisement’s ID. 

Age 6 | Age of the user 

UserFreq 25 | Number of appearances of the same user. 
Position 3 | Position of advertisement on search page. 
Depth 3 | Number of ads in the session. 

QueryFreq 25 | Number of occurrences of the same query. 
AdFreq 25 | Number of occurrences of the same ad. 
QueryLength 20 | Number of words in the query. 

TitleLength 30 | Number of words in the ad title. 

DespLength 50 | Number of words in the ad description. 
QueryCtr 150 | Average click-through-rate for query. 
UserCtr 150 | Average click-through-rate for user. 

AdvrCtr 150 | Average click-through-rate for advertiser. 
WordCtr 150 | Average click-through-rate for keyword advertised. 
UserAdFreq 20 | Number of times this user sees an ad. 
UserQueryFreq 20 | Number of times this user performs a search. 





Table 1: Features used in online advertisement prediction problem. 


5. Experiments with Advertising Data 


Predicting whether a user of a search engine will click on an advertisement presented to him or 
her is of central importance to the business of several internet companies, and in this section, we 
present experiments studying the performance of the AVGM and SAVGM methods for this task. We 
use a large data set from the Tencent search engine, soso. com (Sun, 2012), which contains 641,707 
distinct advertisement items with N = 235,582,879 data samples. 

Each sample consists of a so-called impression, which in the terminology of the information 
retrieval literature (e.g., see the book by Manning et al., 2008), is a list containing a user-issued 
search, the advertisement presented to the user in response to the search, and a label y € {+1,—1} 
indicating whether the user clicked on the advertisement. The ads in our data set were presented to 
23,669,283 distinct users. 

Transforming an impression into a useable set of regressors x is non-trivial, but the Tencent data 
set provides a standard encoding. We list the features present in the data in Table 1, along with some 
description of their meaning. Each text-based feature—that is, those made up of words, which are 
Query, Keyword, and Title—is given a “bag-of-words” encoding (Manning et al., 2008). This en- 
coding assigns each of 20,000 possible words an index, and if the word appears in the query (or Key- 
word or Title feature), the corresponding index in the vector x is set to 1. Words that do not appear 
are encoded with a zero. Real-valued features, corresponding to the bottom fifteen features in Ta- 
ble 1 beginning with “Age”, are binned into a fixed number of intervals [—°, a1], (a1,a2],..-, (ax,~], 
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Figure 6: The negative log-likelihood of the output of the AVGM, SAVGM, and stochastic methods 
on the held-out data set for the click-through prediction task. (a) Performance of the 
AVGM and SAVGM methods versus the number of splits m of the data. (b) Performance 
of SDCA and SGD baselines as a function of number of passes through the entire data 
set. 


each of which is assigned an index in x. (Note that the intervals and number thereof vary per feature, 
and the dimension of the features listed in Table 1 corresponds to the number of intervals). When a 
feature falls into a particular bin, the corresponding entry of x is assigned a 1, and otherwise the en- 
tries of x corresponding to the feature are 0. Each feature has one additional value for “unknown.” 
The remaining categorical features—gender, advertiser, and advertisement ID (AdID)—are also 
given {0,1} encodings, where only one index of x corresponding to the feature may be non-zero 
(which indicates the particular gender, advertiser, or AdID). This combination of encodings yields 
a binary-valued covariate vector x € {0,1}¢ with d = 741,725 dimensions. Note also that the fea- 
tures incorporate information about the user, advertisement, and query issued, encoding information 
about their interactions into the model. 

Our goal is to predict the probability of a user clicking a given advertisement as a function of 
the covariates in Table 1. To do so, we use a logistic regression model to estimate the probability of 
a click response 





1 
P(y=1|x;0):= j 
05159 = Tex Oa) 
where 0 € R? is the unknown regression vector. We use the negative logarithm of P as the loss, 
incorporating a ridge regularization penalty. This combination yields instantaneous loss 


£(@: (v.y)) =log(1-+ exp(—y (8,x))) + 013. 


In all our experiments, we assume that the population negative log-likelihood risk has local strong 
convexity as suggested by Assumption 2. In practice, we use a small regularization parameter 
A = 10-° to ensure fast convergence for the local sub-problems. 

For this problem, we cannot evaluate the mean-squared error ||@ — 6*||5, as we do not know 
the true optimal parameter 8*. Consequently, we evaluate the performance of an estimate f) using 


3339 


ZHANG, DUCHI AND WAINWRIGHT 


log-loss on a held-out data set. Specifically, we perform a five-fold validation experiment, where 
we shuffle the data and partition it into five equal-sized subsets. For each of our five experiments, 
we hold out one partition to use as the test set, using the remaining data as the training set for 
inference. When studying the AVGM or SAVGM method, we compute the local estimate 0; via a 
trust-region Newton-based method (Nocedal and Wright, 2006) implemented by LIBSVM (Chang 
and Lin, 2011). 

The data set is too large to fit in the memory of most computers: in total, four splits of the 
data require 55 gigabytes. Consequently, it is difficult to provide an oracle training comparison 
using the full N samples. Instead, for each experiment, we perform 10 passes of stochastic dual 
coordinate ascent (SDCA) (Shalev-Shwartz and Zhang, 2012) and 10 passes of stochastic gradient 
descent (SGD) through the data set to get two rough baselines of the performance attained by the 
empirical minimizer for the entire training data set. Figure 6(b) shows the hold-out set log-loss 
after each of the sequential passes through the training data finishes. Note that although the SDCA 
enjoys faster convergence rate on the regularized empirical risk (Shalev-Shwartz and Zhang, 2012), 
the plot shows that the SGD has better generalization performance. 

In Figure 6(a), we show the average hold-out set log-loss (with standard errors) of the estimator 
©; provided by the AVGM method versus number of splits of the data m, and we also plot the log-loss 
of the SAVGM method using subsampling ratios of r € {.1,.25}. The plot shows that for small m, 
both AVGM and SAVGM enjoy good performance, comparable to or better than (our proxy for) the 
oracle solution using all N samples. As the number of machines m grows, however, the de-biasing 
provided by the subsampled bootstrap method yields substantial improvements over the standard 
AVGM method. In addition, even with m = 128 splits of the data set, the SAVGM method gives 
better hold-out set performance than performing two passes of stochastic gradient on the entire data 
set of m samples; with m = 64, SAVGM enjoys performance as strong as looping through the data 
four times with stochastic gradient descent. This is striking, since doing even one pass through 
the data with stochastic gradient descent gives minimax optimal convergence rates (Polyak and 
Juditsky, 1992; Agarwal et al., 2012). In ranking applications, rather than measuring negative log- 
likelihood, one may wish to use a direct measure of prediction error; to that end, Figure 7 shows 
plots of the area-under-the-curve (AUC) measure for the AVGM and SAVGM methods; AUC is a 
well-known measure of prediction error for bipartite ranking (Manning et al., 2008). Broadly, this 
plot shows a similar story to that in Figure 6. 

It is instructive and important to understand the sensitivity of the SAVGM method to the value 
of the resampling parameter r. We explore this question in Figure 8 using m = 128 splits, where 
we plot the log-loss of the SAVGM estimator on the held-out data set versus the subsampling ratio 
r. We choose m = 128 because more data splits provide more variable performance in r. For the 
soso.com ad prediction data set, the choice r = .25 achieves the best performance, but Figure 8 
suggests that mis-specifying the ratio is not terribly detrimental. Indeed, while the performance 
of SAVGM degrades to that of the AVGM method, a wide range of settings of r give improved 
performance, and there does not appear to be a phase transition to poor performance. 


6. Discussion 


Large scale statistical inference problems are challenging, and the difficulty of solving them will 
only grow as data becomes more abundant: the amount of data we collect is growing much faster 
than the speed or storage capabilities of our computers. Our AVGM, SAVGM, and SGDAVGM meth- 
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Figure 7: The area-under-the-curve (AUC) measure of ranking error for the output of the AVGM 
and SAVGM methods for the click-through prediction task. 
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Figure 8: The log-loss on held-out data for the SAVGM method applied with m = 128 parallel splits 
of the data, plotted versus the sub-sampling rate r. 


ods provide strategies for efficiently solving such large-scale risk minimization problems, enjoying 
performance comparable to an oracle method that is able to access the entire large data set. We 
believe there are several interesting questions that remain open after this work. First, nonparametric 
estimation problems, which often suffer superlinear scaling in the size of the data, may provide an 
interesting avenue for further study of decomposition-based methods. Our own recent work has 
addressed aspects of this challenge in the context of kernel methods for non-parametric regression 
(Zhang et al., 2013). More generally, an understanding of the interplay between statistical efficiency 
and communication could provide an avenue for further research, and it may also be interesting to 
study the effects of subsampled or bootstrap-based estimators in other distributed environments. 
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Appendix A. The Necessity of Smoothness 


Here we show that some version of the smoothness conditions presented in Assumption 3 are nec- 
essary for averaging methods to attain better mean-squared error than using only the n samples on 
a single processor. Given the loss function (5), let no = Yj_; 1(x,=0) to be the count of 0 samples. 
Using 8; as shorthand for 01 ;, we see by inspection that the empirical minimizer 9; is 


fe when no < n/2 


— — otherwise. 
no 


For simplicity, we may assume that n is odd. In this case, we obtain that 












































z 1 no _| a 

[61] = gre [gosam] -E leon) 
1,188 mi 1 5 myn 1 1f i n 
40 a NSn 2m a N A 28 Ee Mi] [a ai) 


by the symmetry of the binomial. Adding and subtracting 5 from the term within the braces, noting 
that P(no < n/2) = 1/2, we have the equality 


soled (i-a & (SS. 


i=0 

















If Z is distributed normally with mean 1/2 and variance 1/(4n), then an asymptotic expansion of 
the binomial distribution yields 


(5) > C) an =E A) |O<Z< ;| +0(n-'/2) 


i=0 





























1 1 
> 5E [z—22" |O<Z< J +o(n—'/?) = Q(n-?), 











the final equality following from standard calculations, since E[|Z|] = Q(n-!/). 
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Appendix B. Proof of Theorem 1 


Although Theorem 1 is in terms of bounds on 8" order moments, we prove a somewhat more 
general result in terms of a set of (ko,ki,k2) moment conditions given by 




















E(IVF@X)IP]1<G%, —— EIlV7F(6:X) - V?F(8)|||7] < H”, 
EJL(X)P] <L and E[(L(X)—E[L(X)})®] < L® 












































for 8 € U. (Recall the definition of U prior to Assumption 3). Doing so allows sharper control if 
higher moment bounds are available. The reader should recall throughout our arguments that we 
have assumed min{ko,k,,k2} > 8. Throughout the proof, we use F; and 6; to indicate the local 
empirical objective and empirical minimizer of machine 1 (which have the same distribution as 
those of the other processors), and we recall the notation 1(¢) for the indicator function of the event 
E. 

Before beginning the proof of Theorem 1 proper, we begin with a simple inequality that relates 
the error term 6 — 6* to an average of the errors 0; — 0*, each of which we can bound in turn. 
Specifically, a bit of algebra gives us that 




















































































































2 
— 2 1 
E(||0 —0* =el 0; — 0" | 
[21-2 |[Z¥o.-er], 
-L YF EI/0;-0 q++¥ E[(0;—0*,0; —0*)] 
m j m oes 
1 n *]]2 m(m— 1) n *]]]2 
< -EJO -0 2] + 7 Ejo -o 
1 * * 
< —E|||0: — 0* ||3] + ||E[@: — 6°] |]. (19) 


Here we used the definition of the averaged vector 6 and the fact that for i Æ j, the vectors 0; and 0 j 
are statistically independent, they are functions of independent samples. The upper bound (19) illu- 
minates the path for the remainder of our proof: we bound each of E{||@;— 0*||Ż] and ||E[6; — 6*]||5. 
Intuitively, since our objective is locally strongly convex by Assumption 2, the empirical minimiz- 
ing vector 8, is a nearly unbiased estimator for 0*, which allows us to prove the convergence rates 
in the theorem. 


























We begin by defining three events—which we (later) show hold with high probability—that 
guarantee the closeness of 8; and 6*. In rough terms, when these events hold, the function Fy 
behaves similarly to the population risk Fo around the point 6*; since Fo is locally strongly convex, 
the minimizer 0; of F; will be close to 8*. Recall that Assumption 3 guarantees the existence of a 
ball Up = {0 € R° : ||0 — 6*||, < p} of radius p € (0,1) such that 


IIV°F(@:x) -VAE < Lle- e'l 














for all 6,0’ € Up and any x, where E[L(X)®] < L}. In addition, Assumption 2 guarantees that 
V? Fo (0*) = ÀI. Now, choosing the potentially smaller radius 55 = min{p,pA/4L}, we can define 
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the three “good” events 


Eo := {rE EK) <2}, 
i=1 
2p (Q* 2p (9% pÀ 
E := fiv Fi (8*) — V*Fo(6*) |||, < mt. and (20) 
f= SVR O < Se} 


We then have the following lemma: 

Lemma 6 Under the events Eo, E1, and £ previously defined (20), we have 

2\|VFi(8") Il, 
CE 

The proof of Lemma 6 relies on some standard optimization guarantees relating gradients to mini- 

mizers of functions (e.g., Boyd and Vandenberghe, 2004, Chapter 9), although some care is required 

since smoothness and strong convexity hold only locally in our problem. As the argument is some- 


what technical, we defer it to Appendix E. 
Our approach from here is to give bounds on E|||@, — 0*||Ż] and ||E[@, — 9*]||; by careful Taylor 


l0: —9* ||, < and V°F,(®) > (1—p)Maxa- 



































expansions, which allows us to bound E(||61 = e*l] via our initial expansion (19). We begin by 
noting that whenever the events £o, £1, and £, hold, then VF; (01) = 0, and moreover, by a Taylor 
series expansion of VF; between 0* and 01, we have 


0 = VF (01) = VF,(0*) + VF, (8’) (0; — 0*) 





where 6’ = K0* + (1 — «)0; for some «x € [0,1]. By adding and subtracting terms, we have 


0 = VF; (6*) + (V?F; (0") — V° Fi (8*)) (8: — 0*) 
+ (V°F; (0*) — V° Fo(0*)) (01 — 0*) + V? Fo(0*) (01 —0*). (21) 


Since V?F(0*) = AI, we can define the inverse Hessian matrix £~! := [V*Fo(6*)|~', and setting 
A := 0; — 6*, we multiply both sides of the Taylor expansion (21) by =~! to obtain the relation 


A= —7'VF(0*) +27! (V?F (0*) — VF, (0'))A +E (V*Fo(0*) — VF, (0*))A. 22) 


Thus, if we define the matrices P = V7Fo(0*) — V7F,(0*) and Q = V7F,(0*) — V7F,(0’), equal- 
ity (22) can be re-written as 


6, —6* = -X İVA (0*) += |'(P+Q)(0; — 8"). (23) 


Note that Equation (23) holds when the conditions of Lemma 6 hold, and otherwise we may simply 
assert only that ||6; — 6*||, < R. Roughly, we expect the final two terms in the error expansion (23) 
to be of smaller order than the first term, since we hope that 8; — 8* — 0 and additionally that the 
Hessian differences decrease to zero at a sufficiently fast rate. We now formalize this intuition. 

Inspecting the Taylor expansion (23), we see that there are several terms of a form similar to 
(V7Fo(0*) — VF; (6*)) (0; — 6*); using the smoothness Assumption 3, we can convert these terms 
into higher order terms involving only 0; — 8*. Thus, to effectively control the expansions (22) 
and (23), we must show that higher order terms of the form E|||@; — e*l], for k > 2, decrease 
quickly enough in n. 
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B.0.1 CONTROL OF E[||@; — 6*||5] 


Recalling the events (20), we define £ := £o N E N & and then observe that 









































E (|. — 0* [15] = E[1 (æ) [161 — 8°15] + E[1 (ee |] 61 — 0*1] 
XŽE[1 (s) IVF (*)II5 

(1—p)*Ak 

e 2KE([||VFi (8*) 3] 

T (1—p) Ak 











< 





+ P(E) RE 














+P(E‘)R, 





where we have used the bound ||6 — 6*||, < R for all 0 € ©, from Assumption 1. Our goal is to 
prove that E[||VFi(8*) ||] = O(n-*/?) and that P(£°) = O(n-"/7). We move forward with a two 
lemmas that lay the groundwork for proving these two facts: 














Lemma 7 Under Assumption 3, there exist constants C and C' (dependent only on the moments ko 
and kı respectively) such that 


























[|| VF (8) || Zee. d 24 
IVF II") < Co, an (24) 
2 : eink log*'/? (2d)H™ 
EVA @*) — V°Fo(8") ||] <-> — (25) 








See Appendix F for the proof of this claim. 

As an immediate consequence of Lemma 7, we see that the events £ and £, defined by (20) 
occur with reasonably high probability. Indeed, recalling that £ = £o N £; N £, Boole’s law and 
the union bound imply 





















































P(E*) = P(£) U ET U Ey) 
< PCES) + P( Ey) + P(Ey) 
D n } * «\ |] â n 
EEL LX) -ELONE EIJ- VAO] ZENA] 
< Lk l pki AKI l a _ p) Sko 
1 yi lpg (2d)H" asia 
< C2 nk2/2 l! 1 nki/2 T Co nko/2 (26) 


for some universal constants Co, C1, C2, where in the second-to-last line we have invoked the moment 
bound in Assumption 3. Consequently, we find that 


P(E°)R* = O(Rt(n™/ +n 4 n—/?) for any k €N. 
In summary, we have proved the following lemma: 


Lemma 8 Let Assumptions 2 and 3 hold. For any k € N with k < min{kọ,kı,k2}, we have 














: F ` G = = 7 z 
6s -01 = O (12. St antl) = 0 (a), 


where the order statements hold as n — +o. 
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Now recall the matrix Q = V7F,(6*) — V7F,(6') defined following Equation (22). The following 
result controls the moments of its operator norm: 





Lemma 9 For k < min{k2,ki,ko}/2, we have E||||Q|||$] = O(n~*/?). 











Proof We begin by using Jensen’s inequality and ane 3 to see that 
lla’ < > dlls (0x xi)" <5 ELX l-el. 


Now we apply the Cauchy-Schwarz inequality and Lemma 8, thereby obtaining 


. beml{ir ria : 2k]? p œŒ —k/2 
z[k] < |É») [i161 -6 H =o(1 Upa" , 


where we have used Assumption 3 again. a 


1 
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Lemma 8 allows us to control the first term from our initial bound (19) almost immediately. 
Indeed, using our last Taylor expansion (23) and the definition of the event £ = £o N £, N En, we 
have 












































zlo: 0*1] = E [1 || 2! VA") +271 (P+ 0)(01 — 8°) [3] + Elles 110: - 0°13 


E[||2-1VAi(6")||3] +2E T 
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\(@: —6*)|| 3| +P ER? 


where we have applied the inequality (a+b)? < 2a? +2b?. Again using this same inequality, then 
applying Cauchy-Schwarz and Lemmas 8 and 9, we see that 


E||=-'(P+.9)(6 —8")|[3] < 2e k 


< 21/1 (vel 
= O(n”), 


where we have used the fact that min{ko,k1,k2} > 8 to apply Lemma 9. Combining these results, 
we obtain the upper bound 
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E[||61 — 8*1] < 2E [EVF 0] + OW), (27) 














which completes the first part of our proof of Theorem 1. 











B.0.2 CONTROL OF ||E[0; — 6*]||5 














It remains to consider the ||E[6,—*]||5 term from our initial error inequality (19). When the 


events (20) occur, we know that all derivatives exist, so we may recursively apply our expansion (23) 
of 6, — 0* to find that 


0, — 0* = -E 'V F (0*) +27! (P +0)(01 — 9") 
= E! VF, (0*) +27! (P +Q) [-27'VF (0*) +£ (P +Q)(0: — 6*)] (28) 
saa oeaunaaaaaaaaaaaaasasasasasasasasasasasasassasasasasiIiIiI 


=v 
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where we have introduced v as shorthand for the vector on the right hand side. Thus, with a bit of 
algebraic manipulation we obtain the relation 


6; — 0* = l(gyv+ lc) (81 —0*) =v+ Lee) (81 — 6*) = l(geyv =v+ lec) (81 — 6* —v). (29) 





Now note that E[VF; (0*)] = 0 thus 























Elv] = E [—D7' VF, (6*) +27! (P+ Q)[-27' VF, (6*) +27! (P+) (0; — &)]] 
=E [El (P +Q)£~' [(P + Q) (8; —6*) — VF; (0*)]] . 

















Thus, by re-substituting the appropriate quantities in (29) and applying the triangle inequality, we 
have 










































































||E[6; — 6*]llz 
< ||E[E (P + Q)£™' ((P +Q)(01 — 0*) — VFi(8*))]]|, + [EU (e) (@1 — 8* — v)]||, 
< |E! (P + Q)E7! ((P + Q) (61 — 0*) — VF (0*))]||, + E[l (ee) |181 — O* ||] 
+E [1ee) |-27' VF (80*) +E" (P +0)£~' [-VF; (8*) + (P +0)(0: -6)]l],]- 6O 








Since ||6; — 0* ||, < R by assumption, we have 





© O(Rn™?) 








E[l (ec) ||61 — 0” ||,] < P(E)R 





for any k < min{k2,ki,ko}, where step (i) follows from the inequality (26). Hélder’s inequality also 
yields that 
































m'(P+Q)2°'VF,(0*)||,] < E [e |Z (P+ oll EVA e] 
1/4 1/4 
< VP(E°)E fe" P+O] E(N] 
[lE (P+Q) I= O(log?(d)n~7), and we similarly have 


R[|| E- VF (6*)||3] = O(n~*). Lastly, we have P(£°) = O(n-*/”) for k < min{ko,k1,k2}, whence 
we find that for any such k, 
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E [1 (ee) 



































Recalling Lemmas 7 and 9, we have E 
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5-'(P+Q)5-1VF,(6")|),] = o( log(d)n */*"') l 


We can similarly apply Lemma 8 to the last remaining term in the inequality (30) to obtain that for 
any k < min{k2,kı, ko}, 




















-5-!VF,(6") +£-1(P +0) [-27'VF; (0) +27! (P +0) (01 -6")] |, ] 
= O(n ni), 


E [1 (ee) 





Applying these two bounds, we find that 




















Elo: — 8*]||, < ||E [E7 (P + 0)E~' ((P+.Q)(0; -0) -VA (0|, +0) 6D 








for any k such that k < min{ko,kı,k2}/2 and k < min{ko,ki,k2}/4+1. 
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In the remainder of the proof, we show that part of the bound (31) still consists only of higher- 
order terms, leaving us with an expression not involving 8; — 6*. To that end, note that 


| 


by three applications of Hélder’s inequality, the fact that ||Ax||, < |All], ||x||,, and Lemmas 7, 8 
and 9. Coupled with our bound (31), we use the fact that (a+b)? < 2a” +2b? to obtain 
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JE (P+Q)E"'(P-+0)(61 —6")||3] = 














zo —6*]|[3 <2| 1V F (0*)]||; + O(n). 








(P+ Q)E- 


EJE (32) 








We focus on bounding the remaining expectation. We have the following series of inequalities: 
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Here step (i) follows from Jensen’s inequality and the fact that ||Ax||, < |||A|ll, ||x|],; step Gi) uses the 
Cauchy-Schwarz inequality; and step (iii) follows from the fact that (a+b)? < 2a? + 2b”. We have 
already bounded the first two terms in the product in our proofs; in particular, Lemma 7 guarantees 
that E||||Pl|5] < CH logd /n, while 
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E È Ea i 


for some numerical constant C (recall Lemma 9). Summarizing our bounds on |||P|||, and |||Q|||,, we 
have 
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O(n 3) 


From Assumption 3 we know that E[||VF;(6*)||5] < G? /n and IEI < 1/À, and hence we can 
further simplify the bound (33) to obtain 


í woe C [i loedt Ve (Mp) \ 
Efo; -0 IBS i ( - e| 
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'VF(6")||3] + O(n) 














r [ETVE] + 00) 





n2 


for some numerical constant C, where we have applied our earlier inequality (32). Noting that we 
may (without loss of generality) take p < $, then applying this inequality with the bound (27) on 


E(|/®1 











— 0* 13] we previously proved to our decomposition (19) completes the proof. 
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Appendix C. Proof of Theorem 4 


Our proof of Theorem 4 begins with a simple inequality that mimics our first inequality (19) in 
the proof of Theorem 1. Recall the definitions of the averaged vector 6; and subsampled averaged 
vector 05. Let 0; denote the minimizer of the (an arbitrary) empirical risk F;, and @2 denote the 
minimizer of the resampled empirical risk F) (from the same samples as 01). Then we have 


2 2 2 
= 1 
E < Ei e] +—E | - 34) 
2 2 m 


l-r 2 
Thus, parallel to our proof of Theorem 1, it suffices to bound the two terms in the decomposition (34) 
separately. Specifically, we prove the following two lemmas. 











0; — 105 7 


l-r 


6, — r02 


l-r 








0* 














0* 












































Lemma 10 Under the conditions of Theorem 4, 
0; — 70 A 1 MG GP 1 
| E | - — -o < O(1) ( + diogd ) ae (35) 


7 r(l—r)? \ 26 A4 3 
Lemma 11 Under the conditions of Theorem 4, 





















































z [l0 -0*-r(82- È| < 2+3r)E [|V VR] +0 80 


In conjunction, Lemmas 10 and 11 coupled with the decomposition (34) yield the desired claim. 
Indeed, applying each of the lemmas to the decomposition (34), we see that 


5 |" —r05 _9* 



































<an” 


1 1 
O =f. 29 O -3 
t (a + ae ), 
which is the statement of Theorem 4. 


The remainder of our argument is devoted to establishing Lemmas 10 and 11. Before providing 
their proofs (in Appendices C.3 and C.4 respectively), we require some further set-up and auxiliary 
results. Throughout the rest of the proof, we use the notation 





=e |[V2F(0*) "VE (G 


g 243r 
2 


Y=Y'+R, 


for some random variables Y and Y’ to mean that there exists a random variable Z such that Y = 
Y' +Z and E[||Z||3] = O(n~*).! The symbol Ry may indicate different random variables throughout 
a proof and is notational shorthand for a moment-based big-O notation. We also remark that if we 
have E|||Z||5] = O(a‘n~*), we have Z = a‘/? Ry, since (a‘/?)? =a‘. For shorthand, we also say that 
EJZ] = O(h(n)) if ||E[Z]||, = O(h(n)), which implies that if Z = R; then E[Z] = O(n-*/?), since 


IEIZ] < V ElliZll3] = O(n"). 


1. Formally, in our proof this will mean that there exist random vectors Y, Y’, and Z that are measurable with respect to 
the 6-field 6(X;,...,X»), where Y = Y’ +Z and E[||Z||3] = O(n~*). 
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C.1 Optimization Error Expansion 


In this section, we derive a sharper asymptotic expansion of the optimization errors 8; — 0*. Recall 
our definition of the Kronecker product &, where for vectors u,v we have u@v = uv'. With this 
notation, we have the following expansion of 6; — 0*. In these lemmas, &3 denotes a vector Z for 
which E|||Z I2] < cn”? for a numerical constant c. 














Lemma 12 Under the conditions of Theorem 4, we have 
0; —0* = -EIVA (0*) +27! (V°F (0*) — E)E !V F, (0*) (37) 
-EI V? R (0*) ((27!'VFi(6*)) @ (2! VF (6*))) 
+ (M°G°/0° + GAL? dlog(d)/A*) R. 
We prove Lemma 12 in Appendix G. The lemma requires careful moment control over the expan- 
sion 0; — 0*, leading to some technical difficulty, but is similar in spirit to the results leading to 
Theorem 1. 


An immediately analogous result to Lemma 12 follows for our sub-sampled estimators. Since 
we use [rn] samples to compute 02, the second level estimator, we find 


Lemma 13 Under the conditions of Theorem 4, we have 
02 — 0* = -X !V P (0*) +271(V7F,(0*) — E)E! VF (0*) 
-E IVR (0*) ((E71VP(0*)) 9 (Z'VA(6*))) 
+r? (M2G9/28 + G4L2dlog(d)/M) Ro. 


C.2 Bias Correction 


Now that we have given Taylor expansions that describe the behaviour of 8; — 0* and 0, — 0*, 
we can prove Lemmas 10 and 11 (though, as noted earlier, we defer the proof of Lemma 11 to 
Appendix C.4). The key insight is that expectations of terms involving VF, (0*) are nearly the same 
as expectations of terms involving VF (6*), except that some corrections for the sampling ratio r 
are necessary. 
We begin by noting that 
6, — r02 o= 6, — 0* E a (38) 


l-r l-r l-r 





In Lemmas 12 and 13, we derived expansions for each of the right hand side terms, and since 


E[2~'VF,(0*)| =0 and EJE !VF(0*)]=0, 


























Lemmas 12 and 13 coupled with the rewritten correction (38) yield 

E[0; — 0* — r(02 — 0*)] = —rEJE | (V? F (0*) — E)E- |! VF (0*)] 

+E[E !(V?F (0*) — £)E ~! VF; (9*)] 

+rE[E V? F (0*) (£7 'VR(6*)) @ (271V P (0*)))] 

— EIE V? m (0*) ((27' VF (0*)) @ (27! VF, (0*)))] 

+ O(1)r-'/? (M?G°/a° + GL’ dlog(d) /A®) n°.. (39) 






































s 

















Here the remainder terms follow because of the r~>/ 2R, term on 0, — 0*. 
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C.3 Proof of Lemma 10 


To prove the claim in the lemma, it suffices to show that 


























TEJE! (V? F (0*) — E)E! VF (0*)] = EJE! (V° Fi (0*) —L)E7'VF\(0*)] (40) 


and 











s 





EIE IV? F (0*) (E7 'VR(0*)) @ (£7'VF(0*)))] 
= E[E V? R (0*) ((£7'VF,(8*)) @ (£7'VF (6")))] (41) 














Indeed, these two claims combined with the expansion (39) yield the bound (35) in Lemma 10 
immediately. 

We first consider the difference (40). To barge things notationally simpler, we define functions 
A: X >R? and B: X — Rf via A(x) := 27!(V7f(0*;x) — £) and B(x) := E7!V f (0*;x). If we 
let Sı = {Xj,...,X,} be the original samples and S2 = {Y,...,¥;n} be the subsampled data set, we 


must show 
aE awe) = [pawe], 
LJ ij 


Since the Y; are sampled without replacement (i.e., from P directly), and EJA (X;)] = 0 and E[B(X;)] = 
0, we find that E[A(Y;)B(Y;)] = 0 for i ¢ j, and thus 






























































E EAB) = È EATE) = mEA TBC). 
ij i= 
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cape E EAB] = Z [A(Y1)B(Y1)] Tea) X] 
ij 
= LEEM) 


The statement (41) follows from analogous arguments. 
C.4 Proof of Lemma 11 
The proof of Lemma 11 follows from that of Lemmas 12 and 13. We first claim that 
6, —0* = —E VF (0*)+R and 0- 0* = -EVP (0) +r iR. (42) 


The proofs of both claims similar, so we focus on proving the second statement. Using the inequality 
(a+b+c) <3(a° +b? +c’) and Lemma 13, we see that 


5 
z| 























6° +E°'VA(6")||}| < 3E [|E (VF (6) -E)E VE 
+3E| -1V3 (0*) ((Z7!VA(8")) @ (27! V0") DI 
+3r O(n’). (43) 


























3351 


ZHANG, DUCHI AND WAINWRIGHT 


We now bound the first two terms in inequality (43). Applying the Cauchy-Schwarz inequality and 
Lemma 7, the first term can be upper bounded as 














z |27 (V2F,(6*) — E)E! VF (0 A 


<(E| vani] 
= (r-?) Ollog(d)n?) -r20 = r.0(n-), 
































'(v7FA(6*) -o)l E| 


where the order notation subsumes the logarithmic factor in the dimension. Since V3 (0*) : R? + 
IR¢ is linear, the second term in the inequality (43) may be bounded completely analogously as 
it involves the outer product ©~!'VF;(6*) @L£~!VF;(6*). Recalling the bound (43), we have thus 
shown that 














E [|02 -9* +2-'VAA(6")||3] =r 200), 


or 0) — 0* = -E !V F (0*) +r -!R. The proof of the first equality in Equation (42) is entirely 
analogous. 
We now apply the equalities (42) to obtain the result of the lemma. We have 


























z fie: — 8* — r(@2 -01] = z |||- >-!VF,(0*) + r="! VF (0*) +l] 
Using the inequality (a+b)? < (1+n)a?+(1+1/n)b* for any n > 0, we have 


(l+n)a*+(1+1/n)(b+c)? 
(1+n)a?+(14+1/n)(1+a)b? + (14+1/n)(1+1/a)c? 


(a+b+c)* 








= 
< 


for any n, > 0. Taking n = 1 and a = 1/2, we obtain (a +b +c)? < 2a? + 3b* + 6c’, so applying 
the triangle inequality, we have 


























E, Ile: — 8* -r02 -0)1] = z |||- EIVA (0*)+r2-'VAR(0*) )+ olla] (44) 


£ 
z| 


Since F} is a sub-sampled version of F}, algebraic manipulations yield 


s 
e| 


Combining equations (44) and (45), we obtain the desired bound (36). 








< 


N 























'VFi(8")|[3] +3°E | 





1YE (0*) I + O(n-). 






































'VFi(0")|3| = Z z [|z vr (0f). (45) 





'vRO»I] =—E| 





Appendix D. Proof of Theorem 5 


We begin by recalling that if 6” denotes the output of performing stochastic gradient on one machine, 
then from the inequality (19) we have the upper bound 




















* 2 1 ni n * p n * 
- 0*||;] < —E[||6" — 6° l3] + ||E[6" — 6*]||3- 














E[||0" 
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To prove the error bound (15), it thus suffices to prove the inequalities 


























, aor 

zlo" —8"]3] < S5, and (46) 
2- B? 

|E[6" — 8"}Il2 < -37 (47) 


Before proving the theorem, we introduce some notation and a few preliminary results. Let g, = 
Vf (8';X;) be the gradient of the r’” sample in stochastic gradient descent, where we consider run- 
ning SGD on a single machine. We also let 


II(v) := argmin {||@ — v3} 
0cO 


denote the projection of the point v onto the domain ©. 
We now state a known result, which gives sharp rates on the convergence of the iterates {6"} in 
stochastic gradient descent. 





Lemma 14 (Rakhlin et al., 2012) Assume that 
c > 1, for any t € N we have 
e [le —e" |] < 


With these ingredients, we can now turn to the proof of Theorem 5. Lemma 14 gives the 
inequality (46), so it remains to prove that ©” has the smaller bound (47) on its bias. To that end, 
recall the neighborhood Up C © in Assumption 5, and note that 











Ellis] < @ for allt. Choosing N; = £ for some 


aG? 


Mt 





where o=4c’. 














git! = Q* = T1(6! -Ng — 0*) 
= 6 — Ng: — 0* + 1 (e+1¢up) (11(6' — Nrg) — (6! —181)) 
since when 0 € Up, we have IT(@) = 


Ejot! im 


©. Consequently, an application of the triangle inequality gives 
— (8 —n:87)) 1" ¢ Up)||9] 
By the definition of the projection and the fact that 6’ € ©, we additionally have 

|| 116" — N18) — (8 —1181)||, < jo - (0 


Thus, by applying Hölder’s inequality (with the conjugate choices (p,q) = (4 
tion 5, we have 























ay 











Nigt 











ell <| o"Illy + Ell| CLG! — ngs) — 
=M181))|]2 <M Isla- 


,4)) and Assump- 


























































































































|E rjo +! — e*]l], < | EJO — nrg: — 0* ll nE| silla Logu) 
ies 3/4 
< |E -ng — 8") ||, +n y elle (E Geyl) 
< ||E[6’ — neg: — 6*]| 2+NG (PO g S 
< ||E[6’ — ng: — 0" roH -o*l j 
< JIE -ng -0l +16 > (48) 
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the inequality (48) following from an application of Markov’s inequality. By applying Lemma 14, 
we finally obtain 























a l aG? \3/4 
Ejo“ '_@ Ill. < | Elo — ng; — 8 Ill, MG (az) 


. co3/4G5/2 1 
= | EJO -ng — 8 Ill 15/2p3/2 47/48 

















(49) 





Now we turn to controlling the rate at which 6’ — 1g; goes to zero. Let f;(-) = f(;X,) be 
shorthand for the loss evaluated on the t’” data point. By defining 


ri = 81 — V fi(8*) — V? f (0*)(0' — 8), 


a bit of algebra yields 
gi = V fi(0*) +V? f,(0*) (0f — 0*) + rr. 


Since 6! belongs to the o-field of X;,...,X;—1, the Hessian V? f; (0*) is (conditionally) independent 
of 6’ and 


















































Dik = V?Fo(6*) EJO‘ — 6*] + lr (acu,)! + Er | (a.gu,)!- (50) 
If 8’ € Up, then Taylor’s theorem implies that r; is the Lagrange remainder 
r: = (VŽ f:(8') — V? f:(8")) (6 — 0%), 


where 6’ = «6‘ + (1 — «)6* for some « € [0,1]. Applying Assumption 5 and Hölder’s inequality, we 
find that since 0’ is conditionally independent of X;, 


| 


























rdev) l] < E [IVAX -VPN — 8", Hocu,)| 
<E|L(X,) ||" -l| =E 


P aLG? 
T t` 























L(X,)]E{||6' — 6* |j; 








— 











< LE | 











jo’ —" || 





On the other hand, when 0’ ¢ Up, we have the following sequence of inequalities: 


t [ioeo] VE EE eu)” 


(ii 
È 4/33 (Elei i] ENVANE + ENIA —6°)15]) EE ¢ Up) 


3/4 






































3/4 



































< 39/44/64 + Gt + HAR (P(0' ¢ Up)) 


(iii) aG2 \ 3⁄4 
< PAA . 
Paor (82) 


Here step (i) follows from Hölder’s inequality (again applied with the conjugates (p,q) = (4, $); 
step (ii) follows from Jensen’s inequality, since (a+b+c)* < 33 (a4 +b + c4); and step (iii) follows 
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from Markov’s inequality, as in the bounds (48) and (49). Combining our two bounds on 7;, we find 
that 





F — Ole? 303/4G3/2(G+HR) 1 
[lirl] < VE + 13/2p3/2 BIE 
By combining the expansion (50) with the bound (51), we find that 
[E0 — ng: — 0*]||, = ||E[(U -nV F(0*)) (0 — 8*) +m]]], 


caLG?  3co3/4G3/2(G+HR) 1 
1372 st 15/293/2 Pai 











(51) 






































< |JE[(—1,V?Fo(6"))(8 — 8")]||, 








Using the earlier bound (49), this inequality then yields 





























Efe" -oll < [|Z -nV F0 ||], EL" - 


ca3/4G3/2 f al ALG! 4G+HR 
Illy A \ 2AA Tt p72 


We now complete the proof via an inductive argument using our immediately preceding bounds. 
Our reasoning follows a similar induction given by Rakhlin et al. (2012). First, note that by strong 
convexity and our condition that ||| V*(8*)||| < H, we have 


|||Z —n:V°F(8*) ||| = 1 — NAmin(V Fo (0*) < 1— mA 


whenever 1 —n:H > 0. Define to = [cH /A]; then for t > to we obtain 























Elo — 6*]||, < (1—c/t) ||E[6" 








wy. 1 cee? [al LG!  4G+HR 
olh + aa 45/2 41/241/4 + p3/2 (52) 


For shorthand, we define two intermediate variables 


co3/4G3/2 Ge oa 

















a, = ||E(6‘ — 6*)||, and b= NJE aI + 5372 


Inequality (52) then implies the inductive relation a,,; < (1 —c/t)a,+b/t’/*. Now we show that 
by defining B = max{toR,b/(c— 1)}, we have a; < B/r3/*. Indeed, it is clear that a; < oR. Using 
the inductive hypothesis, we then have 





(i-c/t)B | b _BE-1) Ble~1I)—b  BE-1) B 
< = < < 
Ari S 3/4 11/4 17/4 2 =A > Gpp 
This completes the proof of the inequality (47). E 


D.0.1 REMARK 


If we assume kth moment bounds instead of 4th, that is, BI |||V7£(0*;X) I2] < HÝ and Elle; l4] < G*, 
we find the following analogue of the bound (52): 












































jeet -= 6"]||, < (1 —e/t) [EW -6l 
k-1 | 2k-2 


1 car Ge | (54/*+1)G+54HR  a!/LG?/ 


pa rE or ' 2/ktl/k 
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In this case, if we define 


cat Ge | (54/F4+1)G454VkAR  ol/kLG2/k 


á rT pe i 22/k 








b 
and p= max fior h, 
c—1 


we have the same result except we obtain the bound ||E[0” — 0*] l2 < B?/ nT. 














Appendix E. Proof of Lemma 6 


We first prove that under the conditions given in the lemma statement, the function F; is (1 — p)A- 
strongly convex over the ball U := {0 € R® : ||}6—6*||, < õp} around 6*. Indeed, fix y € U, then 
use the triangle inequality to conclude that 


[VR w -VRO l < [VFM -y?r OIL + [VF 0) -vre 
sy PA 
<L- >. 
Here we used Assumption 3 on the first term and the fact that the event £; holds on the second. By 
our choice of dp) < pA/4L, this final term is bounded by Ap. In particular, we have 


VR) >A so V7Fi(y) = M— pal = (1—p)Al, 


which proves that F; is (1 — p)A-strongly convex on the ball U. 

In order to prove the conclusion of the lemma, we argue that since Fy is (locally) strongly 
convex, if the function F; has small gradient at the point 6*, it must be the case that the minimizer 
6, of F; is near 0*. Then we can employ reasoning similar to standard analyses of optimality for 
globally strongly convex functions (e.g., Boyd and Vandenberghe, 2004, Chapter 9). By definition 
of (the local) strong convexity on the set U, for any 0’ € ©, we have 





1—p)a 
F; (0') > F, (0*) + (VF; (0*),0' — 0*) 4 ( s min { |[9° s RA Í 
Rewriting this inequality, we find that 


2 92 2 
28} S pj 


2 $ * * / * 
ak [Fi (0’) — Fi (0*) + || VF (6*) ||, |0" —6*||,] - 








min { |[9* -e'l [Fi (0') — Fi (8*) +(VF\(6*), 6’ —6*)] 


Dividing each side by ||6’ — 6*||,, then noting that we may set 6’ = KO; + (1 — x)0* for any « € [0,1], 
we have 





min} e*l, 82 ) < 2 [Fi (x9; +(1 —«)0*) — F (0*)] 2 [VF (0° 


P 
K|[61 — @*|lz K(1—p)A||®; — 6* ||, © (L=p)a 


Of course, Fi (01) < Fı(0*) by assumption, so we find that for any « € (0,1) we have the strict 


inequality 
35 2 || VF) (8* 
in e"||,, P? }< [YF MI <, 





k|[61 — 8, (1—p)A 
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the last inequality following from the definition of £. Since this holds for any x € (0,1), if 
||8; — 8*||, > 5p, we may set x = 6)/||6; — 8*||,, which would yield a contradiction. Thus, we 
have ||6; — 8*||, < õp, and by our earlier inequalities, 

2 2 || VF; (8*) 
(1—p)a (1—p)A 


Dividing by ||6; — 8*||, completes the proof. a 





l0: — || < [Fi (01) — F1 (8°) + || VFi (8%) I> [61 — 6*|lz] < 2 |[81 — 8" lla. 


Appendix F. Moment Bounds 


In this appendix, we state two useful moment bounds, showing how they combine to provide a proof 
of Lemma 7. The two lemmas are a vector and a non-commutative matrix variant of the classical 
Rosenthal inequalities. We begin with the case of independent random vectors: 


Lemma 15 (de Acosta, 1981, Theorem 2.1) Let k > 2 and X; be a sequence of independent ran- 
dom vectors in a separable Banach space with norm ||-|| and E{||X;||"] < œ. There exists a finite 
constant C such that 

7 e| | 


n n 
e| yx, yx, 
i=1 


i=l 
We say that a random matrix X is symmetrically distributed if X and —X have the same distri- 
bution. For such matrices, we have: 


Lemma 16 (Chen et al., 2012, Theorem A.1(2)) Let X; € R¢4 be independent and symmetrically 
distributed Hermitian matrices. Then 


n ky 1/k n 1/2 
e| Lx | < \/2elogd he E Ke) 
i=l i=l 


Equipped with these two auxiliary results, we turn to our proof Lemma 7. To prove the first 
bound (24), let 2 < k < ko and note that by Jensen’s inequality, we have 






























































k ss k/2 
Js (z ra) +P E(x] - 


i=1 i=1 






























































1/k 
+ 2eloga (Emax Xl) | 










































































E[|VFi(0*)I5] < 21 || IVF (6°) [lp — EYF (8°) bal | +2 ELI (6°) Il 














Again applying Jensen’s inequality, E[||V f (0*; X) (31 < G’. Thus by recalling the definition VF; (6*) = 
1 ”_, Vf(0*;X;) and applying the inequality 

















E(IVFi(0")lla] < EVF (0I? < n26, 























we see that Lemma 15 implies E [IVF (0*) j is upper bounded by 









































k/2 
1# 1< 
2C (š Éire) + | VE(IVS(O*X)lla] | +2 "E[IIVF (6°) ILI" 


= 








2k-1 GK 
; 7 k 
ELV f(8"; Xi) [lo] | + 7 




















n iZi 1 


k/2 
ri [1x *: X)| E 
<1 | (3 L EUV EP 
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Applying Jensen’s inequality yields 


























i=l 


k/2 
lJ 1# 
(7.5 21 osx < = Y ELV Asx) Ig"? < Gf, 


completes the proof of the inequality (24). 

The proof of the bound (25) requires a very slightly more delicate argument involving sym- 
metrization step. Define matrices Z; = 4 (V? f (9*; X;) — V7Fo(6*)). Ife; € {+1 } are iid. Rademacher 
variables independent of Z;, then for any integer k in the interval [2,k2], a standard symmetrization 
argument (e.g., Ledoux and Talagrand, 1991, Lemma 6.3) implies that 


kJ 1/k ky 1/k 
|“ <lh" 


Now we may apply Lemma 16, since the matrices €;Z; are Hermitian and symmetrically dis- 
tributed; by expanding the definition of the Z;, we find that 
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e| ye 


i=1 
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Ł €;Z; 
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E [||| V2Fi(8*) — VF") Jane ‘<svioæd||( 3 ye EEX -VRE A) 























+ 4elogd (n=*Emax ||V2 F6; X) NG smth) 
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Since the X; are i.i.d., we have 


|G inven- 



















































































<n e] Fx) -VPRO ] 1/2 
by Jensen’s inequality, since Ilat alis Ila ? for semidefinite A. Finally, noting that 
1 
TE [max 26x- VPRED] < Fe [0e —V?A(0") ||] <a 


completes the proof of the second bound (25). 


Appendix G. Proof of Lemma 12 


The proof follows from a slightly more careful application of the Taylor expansion (21). The starting 
point in our proof is to recall the success events (20) and the joint event £ := £o N £ N £. We 
begin by arguing that we may focus on the case where £ holds. Let C denote the right hand side 
of the equality (37) except for the remainder R3 term. By Assumption 3, we follow the bound (26) 
(with min{ko, k1,k2} > 8) to find that 

















E [1e 01- 8113] = O (R7), 


so we can focus on the case where the joint event £ = £o N £ N Ey does occur. 
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Defining A = 0; — 6* for notational convenience, on £ we have that for some x € [0,1], with 
0’ = (1—«)6; + k6*, 
0 = VF, (0*) + V7F, (0*)A + VŽF (0) (AQA) 
= VF, (0*) + V7Fo(0*)A + V°Fo(6*)(A@A) 
+ (VF; (0*) — V7Fo(6*))A + (V3F; (0’) — V7Fo(6*))(A@A). 





Now, we recall the definition X = V*Fo(0*), the Hessian of the risk at the optimal point, and solve 
for the error A to see that 
A= E IVF (0*)— £! (VF, (0*) —L)A— 27! V7F;(6*)(A@A) 
+271(V3F(6*) — V3F,(0’))(A@A) (53) 
on the event £. As we did in the proof of Theorem 1, specifically in deriving the recursive equal- 
ity (28), we may apply the expansion (23) of A = 6; — 0* to obtain a clean asymptotic expansion of 
A using (53). Recall the definition P = V?Fo(6*) — VF; (0*) for shorthand here (as in the expan- 


sion (23), though we no longer require Q). 
First, we claim that 


I(g)(V?Fo(0*) — VF, (6’))(A@A) = (M?G°/A° + G*L*dlog(d) /A*) R. (54) 


To prove the above expression, we add and subtract V3F;(0*) (and drop 1(¢) for simplicity). We 
must control 


(V3F(8") — VF (8"))(A@A) + (V3Fi(6") —V3F(8'))(A@A). 


To begin, recall that ||u v|||, = Ilev! |||, = |lu||, ||v||,. By Assumption 4, on the event £ we have 
that V3F; is (1/n) Y"_, M(X;)-Lipschitz, so defining M, = (1/n) £}; M(X;), we have 























E [1c ||(V3Fi 6") - V°F (0) (A@A)|f] < E [Mz o -o'li al 














1/4 2 GÉ 


6n? 
by Hölder’s inequality and Lemma 8. The remaining term we must control is the derivative differ- 


ence E|||(V7Fi (6*) — V>Fo(0*))(A@A)||3]. Define the random vector-valued function G = V (F; — 
Fo), and let G; denote its jth coordinate. Then by definition we have 














3/4 
< E [m$] "E [lo -0i < oam 














(V°F (0*) — V>Fo(9*))(A@A) = [AT (V°G,(6"))A zz AT(v26,(0"))4] ER’. 


Therefore, by the Cauchy-Schwarz inequality and the fact that x' Ax < |IA]llz IIx|I3. 
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Applying Lemma 8 yields that E[||A||$] = O(G/(A2n)*). Introducing the shorthand notation 
g(x) := Vf (3x) — VFo(-), we can write 





6") 1S 
@)= LV 


For every coordinate j, the random matrices Vg ;(6*;X;) (i=1,...,n) are i.i.d. and mean zero. By 


Assumption 3, we have |||V7g;(0*:X Dll < < 2L(X;), whence we ivei {||| V2e;(8*:X:) |I|5] ZDTS, 
Applying Lemma 16, we obtain 


= 
e| 





























G KODI < O(1)Lfn log? (d), 


and hence 





: GL? 
E [ J4 dlog(d)n?, 
which implies the desired result (54). From now on, terms of the form R, will have no larger 
constants than those in the equality (54), so we ignore them. 

Now we claim that 

















(VF (8") — V3F9(8"))(A@A)||3| < 00) 


Lir) V Fi (8*)(A@A) = V°F (0*)((£7 VF (0*)) @ (Z| VF{(0*))) + R. (55) 
Indeed, applying the expansion (23) to the difference A = 0; — 0*, we have on £ that 


AQA = (Z!VF,(0*)) @ (2! VF,(0*)) + (2! PA) @ (27! PA) 
— (£7! PA) @ (27! VF, (6*)) — (27! VF;, (0*)) @ (Z| PA), 


We can bound each of the second three outer products in the equality above similarly; we focus on 
the last for simplicity. Applying the Cauchy-Schwarz inequality, we have 




















z [EVRE EPa] < (E [Evra] E [Eee - 8°) | ae 




















From Lemmas 8 and 9, we obtain that 














z [EVA (|3| = O(n?) and E[|[E-'P(@ -6")[[,] = O(n) 














after an additional application of Cauchy-Schwarz for the second expectation. This shows that 
(2! VF (0*)) @(Z'PA) = &, 


and a similar proof applies to the other three terms in the outer product A® A. Using the linearity of 
V3 F (0*), we see that to prove the equality (55), all that is required is that 


L(g) VF (0*) (2! VF, (0*)) 9 (£7 VF, (6*))) = R. (56) 
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For this, we apply Holder’s inequality several times. Indeed, we have 











E [|1 V5Fi(6") (EVF 6") 8 (EVF (6) [| 
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For the final asymptotic bound, we used Equation (26) to bound E[1(¢¢)], used the fact (from As- 
sumption 3) that E[L(X)8] < L8 to bound the term involving V>F,(6*), and applied Lemma 7 to 
control E[||Z~! VF; (6*)||$]. Thus the equality (56) holds, and this completes the proof of the equal- 
ity (55). 

For the final step in the lemma, we claim that 





























—1(g)Z7! (V*F, (0*) —L)A = £~! (V° F, (0*) — E)E !V F (0*) + R. (57) 


To prove (57) requires an argument completely parallel to that for our claim (55). As before, we use 
the expansion (23) of the difference A to obtain that on £, 


—71(V7F,(0*)—X)A 

=L !(V°FR (0*)—Z)E IVA (0*)— E~! (V2, (0*)— E)E IPA. 
Now apply Lemmas 8 and 9 to the final term after a few applications of Hölder’s inequality. To 
finish the equality (57), we argue that 1(ge)} £~! (V°F,(0*) — E)E-'VF (0*) = R, which follows 


exactly the line of reasoning used to prove the remainder (56). 
Applying equalities (54), (55), and (57) to our earlier expansion (53) yields that 


A= lg |- EVF (0*) — 27 (V?°F (0*) — 2)A — 2'V?A (0*) (AQA) 
+E! (V° (0*) — V?F\(6'))(A@A)] +1 (eA 
= -E | VF (0*) +27! (V°F;, (6*) —Z)27! VF, (0*) 
— 7! VF; (6*) (27! VF, (6*)) @ (2! VF (8*))) +R + Leo. 














Finally, the bound (26) implies that E[1 (cc) \|All5] < P(£°)R? = O(n~), which yields the claim. 
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